challenge_2
Author

Young Soo Choi

Published

August 16, 2022

Code
library(tidyverse)
library(readxl)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to

  1. read in a data set, and describe the data using both words and any supporting information (e.g., tables, etc)
  2. provide summary statistics for different interesting groups within the data, and interpret those statistics

Read in the Data

I choosed ’StateCounty2012.xls’file and read it. But it has first 3 rows about title and any others. So I removed the first 3 rows using skip code.

Code
library(readxl)
state2012 <- read_xls("_data/StateCounty2012.xls", skip=3)
state2012
# A tibble: 2,990 × 5
   STATE     ...2  COUNTY               ...4  TOTAL
   <chr>     <lgl> <chr>                <lgl> <dbl>
 1 AE        NA    APO                  NA        2
 2 AE Total1 NA    <NA>                 NA        2
 3 AK        NA    ANCHORAGE            NA        7
 4 AK        NA    FAIRBANKS NORTH STAR NA        2
 5 AK        NA    JUNEAU               NA        3
 6 AK        NA    MATANUSKA-SUSITNA    NA        2
 7 AK        NA    SITKA                NA        1
 8 AK        NA    SKAGWAY MUNICIPALITY NA       88
 9 AK Total  NA    <NA>                 NA      103
10 AL        NA    AUTAUGA              NA      102
# … with 2,980 more rows
# ℹ Use `print(n = ...)` to see more rows
Code
colnames(state2012)
[1] "STATE"  "...2"   "COUNTY" "...4"   "TOTAL" 

This dataset has 2990 rows and 5 columns. And each column name is “STATE”, “…2”, “COUNTY”, “…4”, “TOTAL”.

Describe the data

This dataset is about the number of workers related to railroad jobs in 2012 I think. And this dataset contains data of 2990 county.

Code
summary(state2012)
    STATE             ...2            COUNTY            ...4        
 Length:2990        Mode:logical   Length:2990        Mode:logical  
 Class :character   NA's:2990      Class :character   NA's:2990     
 Mode  :character                  Mode  :character                 
                                                                    
                                                                    
                                                                    
                                                                    
     TOTAL         
 Min.   :     1.0  
 1st Qu.:     7.0  
 Median :    22.0  
 Mean   :   256.9  
 3rd Qu.:    71.0  
 Max.   :255432.0  
 NA's   :5         

Provide Grouped Summary Statistics

I choosed MA and CA because MA is the state where I live in and CA is much bigger state I think. I used filter code to figure out MA’s and CA’s central tendency numbers.

Code
MA<-filter(state2012, STATE=="MA")
summary(MA)
    STATE             ...2            COUNTY            ...4        
 Length:12          Mode:logical   Length:12          Mode:logical  
 Class :character   NA's:12        Class :character   NA's:12       
 Mode  :character                  Mode  :character                 
                                                                    
                                                                    
                                                                    
     TOTAL      
 Min.   : 44.0  
 1st Qu.:101.8  
 Median :271.0  
 Mean   :281.6  
 3rd Qu.:396.8  
 Max.   :673.0  
Code
CA<-filter(state2012, STATE=="CA")
summary(CA)
    STATE             ...2            COUNTY            ...4        
 Length:55          Mode:logical   Length:55          Mode:logical  
 Class :character   NA's:55        Class :character   NA's:55       
 Mode  :character                  Mode  :character                 
                                                                    
                                                                    
                                                                    
     TOTAL       
 Min.   :   1.0  
 1st Qu.:  12.5  
 Median :  61.0  
 Mean   : 238.9  
 3rd Qu.: 200.5  
 Max.   :2888.0  

Explain and Interpret

MA has 12 county and average total is 286.1 and CA has 55 county and average total is 238.9. So I figure CA has more county than MA but MA’s average total is larger than CA’s.